Use a private task-local RNG for kernel launch seeds#3161
Conversation
a7bf4f3 to
c1144c5
Compare
Not needed. The user can always put a |
| # Task-local RNG used solely to seed kernel launches. Drawing launch seeds | ||
| # from a private RNG (rather than the task-global default) ensures that a kernel | ||
| # launch never perturbs the user-visible `rand()` stream. Lazily created per | ||
| # task; `Xoshiro()` seeds itself from system entropy without touching the | ||
| # default RNG. (JuliaGPU/CUDA.jl#2417) |
There was a problem hiding this comment.
Please de-LLM some of these comments: no need to refer to the previous state, #2417 was a TODO so not worth pointing to, etc.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3161 +/- ##
==========================================
- Coverage 16.33% 16.32% -0.02%
==========================================
Files 124 124
Lines 9875 9875
==========================================
- Hits 1613 1612 -1
- Misses 8262 8263 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 88072e4 | Previous: aa47d7a | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
99401 ns |
98976 ns |
1.00 |
array/accumulate/Float32/dims=1 |
75801 ns |
75470 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1596618 ns |
1595788 ns |
1.00 |
array/accumulate/Float32/dims=2 |
140800 ns |
140419 ns |
1.00 |
array/accumulate/Float32/dims=2L |
653847 ns |
653444 ns |
1.00 |
array/accumulate/Int64/1d |
118385 ns |
118155 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79035 ns |
78907 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1708144 ns |
1709506 ns |
1.00 |
array/accumulate/Int64/dims=2 |
153801 ns |
153939 ns |
1.00 |
array/accumulate/Int64/dims=2L |
959403 ns |
959330 ns |
1.00 |
array/broadcast |
18271 ns |
18270 ns |
1.00 |
array/construct |
1222.3 ns |
1198.4 ns |
1.02 |
array/copy |
16408 ns |
16676 ns |
0.98 |
array/copyto!/cpu_to_gpu |
212320 ns |
211135 ns |
1.01 |
array/copyto!/gpu_to_cpu |
279697 ns |
278832 ns |
1.00 |
array/copyto!/gpu_to_gpu |
10291 ns |
10531 ns |
0.98 |
array/iteration/findall/bool |
133390 ns |
131993 ns |
1.01 |
array/iteration/findall/int |
147460 ns |
146745 ns |
1.00 |
array/iteration/findfirst/bool |
111639 ns |
111631 ns |
1.00 |
array/iteration/findfirst/int |
112010 ns |
111858 ns |
1.00 |
array/iteration/findmin/1d |
65743 ns |
66902 ns |
0.98 |
array/iteration/findmin/2d |
100431 ns |
100550 ns |
1.00 |
array/iteration/logical |
191970 ns |
189124 ns |
1.02 |
array/iteration/scalar |
64769 ns |
66015 ns |
0.98 |
array/permutedims/2d |
49410 ns |
49598 ns |
1.00 |
array/permutedims/3d |
50855 ns |
50240 ns |
1.01 |
array/permutedims/4d |
50619 ns |
50411 ns |
1.00 |
array/random/rand/Float32 |
11524 ns |
11982 ns |
0.96 |
array/random/rand/Int64 |
24234 ns |
23515 ns |
1.03 |
array/random/rand!/Float32 |
7935 ns |
8122 ns |
0.98 |
array/random/rand!/Int64 |
20840 ns |
20501 ns |
1.02 |
array/random/randn/Float32 |
34692 ns |
34458 ns |
1.01 |
array/random/randn!/Float32 |
23818 ns |
24130 ns |
0.99 |
array/reductions/mapreduce/Float32/1d |
33469 ns |
33763 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1 |
38081 ns |
38228 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
49795 ns |
50249 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2 |
55488 ns |
55439 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
67122 ns |
67163 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
39352 ns |
40187 ns |
0.98 |
array/reductions/mapreduce/Int64/dims=1 |
41051 ns |
40738 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1L |
86289 ns |
86458 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
57638 ns |
57724 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
82659 ns |
82751 ns |
1.00 |
array/reductions/reduce/Float32/1d |
33347 ns |
33392 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
38115 ns |
38091 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
49855 ns |
49963 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
55304 ns |
55491 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
68959 ns |
68898 ns |
1.00 |
array/reductions/reduce/Int64/1d |
38812 ns |
40022 ns |
0.97 |
array/reductions/reduce/Int64/dims=1 |
40602 ns |
40672 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
86182 ns |
86301 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
57354 ns |
57311 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
82719 ns |
82411 ns |
1.00 |
array/reverse/1d |
16735 ns |
16807 ns |
1.00 |
array/reverse/1dL |
67674 ns |
67676 ns |
1.00 |
array/reverse/1dL_inplace |
65157 ns |
65187 ns |
1.00 |
array/reverse/1d_inplace |
8217 ns |
9321.666666666666 ns |
0.88 |
array/reverse/2d |
19752 ns |
19959 ns |
0.99 |
array/reverse/2dL |
71559 ns |
71879 ns |
1.00 |
array/reverse/2dL_inplace |
64905 ns |
65104 ns |
1.00 |
array/reverse/2d_inplace |
9515 ns |
11067 ns |
0.86 |
array/sorting/1d |
2650235 ns |
2655417 ns |
1.00 |
array/sorting/2d |
1040002 ns |
1038734 ns |
1.00 |
array/sorting/by |
3192735 ns |
3192232 ns |
1.00 |
cuda/synchronization/context/auto |
1131.6 ns |
1131.5 ns |
1.00 |
cuda/synchronization/context/blocking |
904.6666666666666 ns |
952.2173913043479 ns |
0.95 |
cuda/synchronization/context/nonblocking |
5874 ns |
6097.8 ns |
0.96 |
cuda/synchronization/stream/auto |
992.25 ns |
1004.5 ns |
0.99 |
cuda/synchronization/stream/blocking |
811.2 ns |
825.6363636363636 ns |
0.98 |
cuda/synchronization/stream/nonblocking |
5921 ns |
6045.333333333333 ns |
0.98 |
integration/byval/reference |
143128 ns |
143141 ns |
1.00 |
integration/byval/slices=1 |
145262 ns |
145110 ns |
1.00 |
integration/byval/slices=2 |
283680 ns |
283495 ns |
1.00 |
integration/byval/slices=3 |
422143 ns |
422045 ns |
1.00 |
integration/cudadevrt |
101447 ns |
101557 ns |
1.00 |
integration/volumerhs |
9078485 ns |
9077766 ns |
1.00 |
kernel/indexing |
12466 ns |
12534 ns |
0.99 |
kernel/indexing_checked |
13466 ns |
13291 ns |
1.01 |
kernel/launch |
2040.2222222222222 ns |
2072.5555555555557 ns |
0.98 |
kernel/occupancy |
699.2517006802722 ns |
716.9097744360902 ns |
0.98 |
kernel/rand |
13666 ns |
13723 ns |
1.00 |
latency/import |
3845165278 ns |
3841987489 ns |
1.00 |
latency/precompile |
4621357772 ns |
4621684240 ns |
1.00 |
latency/ttfp |
4491670603 ns |
4482964065 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
899cf3b to
88072e4
Compare
Kernel launches used to draw their seed from the default RNG, perturbing the user-visible rand() stream. Use a lazily-created task-local Xoshiro instead. Includes a regression test.
Fixes #2417.
Problem
make_seed(::HostKernel)draws from Julia's default RNG on every kernel launch, so launching a kernel advances the user-visiblerand()stream:Fix
Following the direction suggested in the issue ("probably better to maintain a local RNG in CUDA.jl for launching kernels"),
make_seednow draws from a private task-localXoshirothat is lazily created intask_local_storage():Xoshiro()seeds itself from system entropy without touching the default RNG, so the launch RNG is fully decoupled from the default RNG in both directions:rand()stream, andRandom.seed!(...)no longer influences which seeds reach kernels (device-side reproducibility remains available via in-kernelRandom.seed!, which is unchanged).Task-local storage was chosen over a module-global RNG to stay thread-safe without locking on the launch path, mirroring how Julia's own default RNG is per-task. The
get!(factory, dict, key)form matches the existingdevices()helpers inlib/cublas/lib/cusolver, and the::Random.Xoshiroassertion keeps the launch path type-stable (0 allocations on the fast path). Device-side seeding (make_seed(::DeviceKernel)anddevice/random.jl's Philox2x32) is untouched.An alternative considered: a global atomic counter (Philox keys only need to be distinct, not random, so sequential keys would give independent streams). That would make seeds deterministic across runs, which is a bigger semantic change — happy to switch if that behavior is preferred.
Testing
Added a regression test in
test/core/execution.jlasserting the host RNG stream is identical with and without an interleaved launch, and that consecutive launches still receive distinct seeds. Verified on a Quadro RTX 6000 (sm_75, Julia 1.11.9): new testset passes (6/6) and the existing device-side RNG tests are unaffected (11/11).Possible follow-up (out of scope here)
If deterministic launch seeds are ever wanted (e.g. for debugging), a
CUDA.seed_launch!(seed)that reseeds the task-local launch RNG would slot in cleanly on top of this — happy to do that as a separate PR if there's interest.